First Steps with TensorFlow

Learning Objectives:

Learn fundamental TensorFlow concepts
Use the LinearRegressor class in TensorFlow to predict median housing price, at the granularity of city blocks, based on one input feature
Evaluate the accuracy of a model's predictions using Root Mean Squared Error (RMSE)
Improve the accuracy of a model by tuning its hyperparameters



In [4]:

    
# Load the necessary libraries
import math

from IPython import display
from matplotlib import cm, gridspec, pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

tf.logging.set_verbosity(tf.logging.ERROR)
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format



In [122]:

    
# Load the dataset
california_housing_dataframe = pd.read_csv("https://storage.googleapis.com/mledu-datasets/california_housing_train.csv", sep=",")

We'll randomize the data, just to be sure not to get any pathological ordering effects that might harm the performance of Stochastic Gradient Descent. Additionally, we'll scale median_house_value to be in units of thousands, so it can be learned a little more easily with learning rates in a range that we usually use.



In [123]:

    
california_housing_dataframe = california_housing_dataframe.reindex(np.random.permutation(california_housing_dataframe.index))
california_housing_dataframe["median_house_value"] /= 1000
california_housing_dataframe









    Out[123]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
    
  
  
    
      11692
      -121.3
      38.1
      17.0
      3507.0
      696.0
      1867.0
      709.0
      3.2
      120.7
    
    
      15861
      -122.4
      37.8
      52.0
      2088.0
      487.0
      1082.0
      488.0
      2.7
      490.0
    
    
      4185
      -118.0
      33.8
      25.0
      3179.0
      639.0
      2526.0
      623.0
      3.3
      180.8
    
    
      10693
      -120.6
      38.8
      22.0
      1236.0
      273.0
      615.0
      248.0
      3.0
      106.9
    
    
      7468
      -118.4
      34.2
      34.0
      1471.0
      423.0
      995.0
      386.0
      3.0
      188.7
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      15467
      -122.3
      37.6
      52.0
      2351.0
      494.0
      1126.0
      482.0
      4.0
      356.9
    
    
      4466
      -118.0
      34.1
      37.0
      1275.0
      177.0
      598.0
      174.0
      7.2
      500.0
    
    
      6590
      -118.3
      34.0
      41.0
      1933.0
      791.0
      3121.0
      719.0
      1.9
      147.5
    
    
      5181
      -118.1
      33.9
      38.0
      1475.0
      269.0
      827.0
      265.0
      4.8
      191.6
    
    
      11108
      -121.0
      37.7
      27.0
      2278.0
      479.0
      995.0
      449.0
      2.5
      110.2
    
  

17000 rows × 9 columns

Examine the Data

It's a good idea to get to know your data a little bit before you work with it.

We'll print out a quick summary of a few useful statistics on each column: count of examples, mean, standard deviation, max, min, and various quantiles.



In [124]:

    
california_housing_dataframe.describe()









    Out[124]:







  
    
      
      longitude
      latitude
      housing_median_age
      total_rooms
      total_bedrooms
      population
      households
      median_income
      median_house_value
    
  
  
    
      count
      17000.0
      17000.0
      17000.0
      17000.0
      17000.0
      17000.0
      17000.0
      17000.0
      17000.0
    
    
      mean
      -119.6
      35.6
      28.6
      2643.7
      539.4
      1429.6
      501.2
      3.9
      207.3
    
    
      std
      2.0
      2.1
      12.6
      2179.9
      421.5
      1147.9
      384.5
      1.9
      116.0
    
    
      min
      -124.3
      32.5
      1.0
      2.0
      1.0
      3.0
      1.0
      0.5
      15.0
    
    
      25%
      -121.8
      33.9
      18.0
      1462.0
      297.0
      790.0
      282.0
      2.6
      119.4
    
    
      50%
      -118.5
      34.2
      29.0
      2127.0
      434.0
      1167.0
      409.0
      3.5
      180.4
    
    
      75%
      -118.0
      37.7
      37.0
      3151.2
      648.2
      1721.0
      605.2
      4.8
      265.0
    
    
      max
      -114.3
      42.0
      52.0
      37937.0
      6445.0
      35682.0
      6082.0
      15.0
      500.0

Build the First Model

In this exercise, we'll try to predict median_house_value, which will be our label (sometimes also called a target). We'll use total_rooms as our input feature.

NOTE: Our data is at the city block level, so this feature represents the total number of rooms in that block.

To train our model, we'll use the LinearRegressor interface provided by the TensorFlow Estimator API. This API takes care of a lot of the low-level model plumbing, and exposes convenient methods for performing model training, evaluation, and inference.

Step 1: Define Features and Configure Feature Columns

In order to import our training data into TensorFlow, we need to specify what type of data each feature contains. There are two main types of data we'll use in this and future exercises:

Categorical Data: Data that is textual. In this exercise, our housing data set does not contain any categorical features, but examples you might see would be the home style, the words in a real-estate ad.
Numerical Data: Data that is a number (integer or float) and that you want to treat as a number. As we will discuss more later sometimes you might want to treat numerical data (e.g., a postal code) as if it were categorical.

In TensorFlow, we indicate a feature's data type using a construct called a feature column. Feature columns store only a description of the feature data; they do not contain the feature data itself.

To start, we're going to use just one numeric input feature, total_rooms. The following code pulls the total_rooms data from our california_housing_dataframe and defines the feature column using numeric_column, which specifies its data is numeric:



In [132]:

    
# Define the input feature: total_rooms.
my_feature = california_housing_dataframe[["total_rooms"]]

# Configure a numeric feature column for total_rooms.
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

Step 2: Define the Target

Next, we'll define our target, which is median_house_value. Again, we can pull it from our california_housing_dataframe:



In [133]:

    
# Define the label
targets = california_housing_dataframe["median_house_value"]

Step 3: Configure the LinearRegressor

Next, we'll configure a linear regression model using LinearRegressor. We'll train this model using the GradientDescentOptimizer, which implements Mini-Batch Stochastic Gradient Descent (SGD). The learning_rate argument controls the size of the gradient step.

NOTE: To be safe, we also apply gradient clipping to our optimizer via clip_gradients_by_norm. Gradient clipping ensures the magnitude of the gradients do not become too large during training, which can cause gradient descent to fail.



In [134]:

    
# Use gradient descent as the optimizer for training the model.
my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)

# Configure the linear regression model with our feature columns and optimizer.
# Set a learning rate of 0.0000001 for Gradient Descent.

linear_regressor = tf.estimator.LinearRegressor(
    feature_columns=feature_columns, 
    optimizer=my_optimizer
)

Step 4: Define the Input Function

To import our California housing data into our LinearRegressor, we need to define an input function, which instructs TensorFlow how to preprocess the data, as well as how to batch, shuffle, and repeat it during model training.

First, we'll convert our pandas feature data into a dict of NumPy arrays. We can then use the TensorFlow Dataset API to construct a dataset object from our data, and then break our data into batches of batch_size, to be repeated for the specified number of epochs (num_epochs).

NOTE: When the default value of num_epochs=None is passed to repeat(), the input data will be repeated indefinitely.

Next, if shuffle is set to True, we'll shuffle the data so that it's passed to the model randomly during training. The buffer_size argument specifies the size of the dataset from which shuffle will randomly sample.

Finally, our input function constructs an iterator for the dataset and returns the next batch of data to the LinearRegressor.



In [ ]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
11692	-121.3	38.1	17.0	3507.0	696.0	1867.0	709.0	3.2	120.7
15861	-122.4	37.8	52.0	2088.0	487.0	1082.0	488.0	2.7	490.0
4185	-118.0	33.8	25.0	3179.0	639.0	2526.0	623.0	3.3	180.8
10693	-120.6	38.8	22.0	1236.0	273.0	615.0	248.0	3.0	106.9
7468	-118.4	34.2	34.0	1471.0	423.0	995.0	386.0	3.0	188.7
...	...	...	...	...	...	...	...	...	...
15467	-122.3	37.6	52.0	2351.0	494.0	1126.0	482.0	4.0	356.9
4466	-118.0	34.1	37.0	1275.0	177.0	598.0	174.0	7.2	500.0
6590	-118.3	34.0	41.0	1933.0	791.0	3121.0	719.0	1.9	147.5
5181	-118.1	33.9	38.0	1475.0	269.0	827.0	265.0	4.8	191.6
11108	-121.0	37.7	27.0	2278.0	479.0	995.0	449.0	2.5	110.2

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	17000.0	17000.0	17000.0	17000.0	17000.0	17000.0	17000.0	17000.0	17000.0
mean	-119.6	35.6	28.6	2643.7	539.4	1429.6	501.2	3.9	207.3
std	2.0	2.1	12.6	2179.9	421.5	1147.9	384.5	1.9	116.0
min	-124.3	32.5	1.0	2.0	1.0	3.0	1.0	0.5	15.0
25%	-121.8	33.9	18.0	1462.0	297.0	790.0	282.0	2.6	119.4
50%	-118.5	34.2	29.0	2127.0	434.0	1167.0	409.0	3.5	180.4
75%	-118.0	37.7	37.0	3151.2	648.2	1721.0	605.2	4.8	265.0
max	-114.3	42.0	52.0	37937.0	6445.0	35682.0	6082.0	15.0	500.0